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DETAILED ACTION 

Response to Arguments 

The examiner notes applicant's request to consider EP-1 422 692 A2, JP- 
361285570A and U.S. Patent 6,748,350. EP-1 422 692 A2 and JP-361285570A, which 
are on the 1449 dated April 9, 2007, were overlooked, but have now been considered. 
The applicant will receive an updated 1449 with this communication. The examiner also 
notes that U.S. patent 6,748,350, filled on the July 2, 2007 IDS, is titled as, "Method to 
compensate for stress between heat spreader and thermal interface material", and has 
no bearing on the subject matter of the instant application; the U.S. patent was crossed 
out on the 1449, since it cannot be considered prior art. 

Applicant's arguments with respect to the objection to the specification and the 
1 122 nd paragraph rejection of claim 6 are persuasive, therefore the objection and 
rejection is withdrawn. 

Applicant also argues that, "KUBALA and SIEGLER, do not disclose or suggest 
segmenting an input audio stream into predetermined length intervals such that portions 
of the intervals overlap one another, as recited in amended claim 1", concluding that, 
"KUBALA does not disclose that the input frames are predetermined length intervals. 
KUBALA does not disclose or suggest anything about the length of the input frames. 
Therefore, KUBALA cannot disclose or suggest segmenting the input audio stream into 
predetermined intervals such that portions of the intervals overlap one another, as 
recited in amended claim 1" (Remarks page 12 and 13); however the examiner 
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respectfully disagrees. Kubala performs speaker segmentation of input speech, where 
frames (intervals) are initially labeled as speech or non-speech. In any speech 
processing system, input speech cannot be segmented into frames without first defining 
the length of that frame (interval). Since the system of Kubala processes speech 
frames (intervals), it is therefore inherent that the frames were first created as 
predetermined intervals of the incoming speech. 

Applicant's arguments with respect to the rejection of claim 3 using Siegler, now 
incorporated into claim 1 , have been considered but are moot in view of the new 
ground(s) of rejection. 

Claim Objections 

Claim 26 objected to because of the following informalities: Claim 26 is 
dependent from cancelled claim 25. The examiner considers this a typographical error, 
and therefore interprets claim 26 as dependent from independent claim 23. Appropriate 
correction is required. 

Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 1 03(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 
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Claims 1, 7, 11, 14, 23, 24 and 30 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Kubala ("Integrated Technologies for Indexing Spoken Language" 
ACM 2000) in view of Aversano ("A New Text-Independent Method for Phoneme 
Segmentation" IEEE 2001). 

As per amended claim 1 , Kubala discloses a method for detecting speaker changes in 
an input audio stream comprising: 

Segmenting the input audio stream into predetermined length intervals (page 53, 
first paragraph, the speech is input as frames); 

Decoding the intervals to produce a set of phones corresponding to each of the 
intervals (page 53, phone class recognition is performed on each frame. Therefore it is 
inherent that a set of phones was decoded for each frame); 

Generating a similarity measurement based on a first portion of the audio stream 
that is within one of the intervals and that occurs prior to a boundary between adjacent 
phones in one of the intervals and a second portion of the audio stream that is within the 
one of the intervals and that occurs after the boundary (page 53, second paragraph, 
speaker change is hypothesized at every phone boundary using a form of a likelihood 
ratio test (similarity score); and 

Detecting speaker changes based on the similarity measurement (page 53, 
second paragraph, speaker change is hypothesized at every phone boundary using a 
form of a likelihood ratio test (similarity score); and 
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Outputting an indication of the detected speaker changes (page 49, the system 
outputs a summary which indicates the location of speaker changes within the 
recording). 

Kubala does not disclose segmenting the input stream into predetermined 
intervals such that the portions of the intervals overlap one another. However, 
segmenting speech into overlapping intervals is known, as indicated in Aversano. 
Aversano discloses a system that segments an input speech signal into 20 ms frames 
(intervals), with 10ms of frame-overlap (page 516, section 2, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to apply the known technique of segmenting an input stream into 
overlapping predetermined intervals in Kubala, since it would produce the predictable 
result of improving the detection of speech, including speaker, changes at the 
boundaries of the frames (intervals). 

As per original claim 7, Kubala in view of Aversano disclose the method of claim 1 , and 
Kubala further discloses wherein the decoded set of phones is selected from a 
simplified corpus of phone classes (page 53, first paragraph, speech is labeled using 
speech and non-speech models). 
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As per amended claim 1 1 , Kubala discloses a device for detecting speaker changes in 
an audio signal, the device comprising: 

A processor (page 52, second column, first paragraph, the Pentium II processor); 

A memory (page 52, the system is run on a computer using a Pentium II 
processor, therefore it is inherent that there are instruction stored in memory) containing 
instructions that when executed by the processor cause the processor to: 

Segment the audio signal into predetermined length intervals (page 53, first 
paragraph, the speech is input as frames), 

Decode the intervals to produce a set of phones corresponding to each of the 
intervals (page 53, phone class recognition is performed on each frame. Therefore it is 
inherent that a set of phones was decoded for each frame), 

Generate a similarity measurement based on a first portion of the audio signal 
that occurs prior to a boundary between phones in one of the sets of phones of an 
interval and a second portion of the audio signal that occurs after the boundary (page 
53, second paragraph, speaker change is hypothesized at every phone boundary using 
a form of a likelihood ratio test (similarity score), and 

Detect speaker changes based on the similarity measurement boundary (page 
53, second paragraph, speaker change is hypothesized at every phone boundary using 
a form of a likelihood ratio test (similarity score), and 
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Store an indication of the detected speaker changes (page 49 and 52, and Figure 
5, the structural summarization, including the locations of speakers in a recording, are 
stored in an XML file in the Indexer subsystem). 

Kubala does not disclose segmenting the input stream into predetermined 
intervals such that the portions of the intervals overlap one another. However, 
segmenting speech into overlapping intervals is known, as indicated in Aversano. 
Aversano discloses a system that segments an input speech signal into 20 ms frames 
(intervals), with 10ms of frame-overlap (page 516, section 2, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to apply the known technique of segmenting an input stream into 
overlapping predetermined intervals in Kubala, since it would produce the predictable 
result of improving the detection of speech, including speaker, changes at the 
boundaries of the frames (intervals). 

As per original claim 14, this claim contains limitations that are similar to the limitations 
cited in claim 7, and is rejected for similar reasons. 

As per previously presented claim 23, Kubala discloses a system comprising: 

An indexer configured to receive input audio data and generate a rich 
transcription from the audio data, the rich transcription including metadata that defines 
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speaker changes in the audio data (page 49 and 52, and Figure 5, the Indexer 
subsystem creates a structural summarization, including a transcript indicating locations 
of speakers in a recording), the indexer including: 

A segmentation component configured to divide the audio into segments of a 
predetermined length (page 53, first paragraph, the speech is input as frames), 

A speaker change detection component configured to detect locations of speaker 
changes in the audio data based on a similarity value calculated at locations in the 
segments that correspond to phone class boundaries (page 53, second paragraph, 
speaker change is hypothesized at every phone boundary using a form of a likelihood 
ratio test (similarity score)); 

A memory system for storing the rich transcription (page 49 and 52, and Figure 
5, the structural summarization, including the locations of speakers in a recording, is 
stored in an XML file in the Indexer subsystem); and 

A server configured to receive requests for documents and to respond to the 
requests by transmitting ones of the rich transcriptions that match the requests (page 
55, Information Retrieval, information indexing and retrieval take place on the 
Rougn'n'Ready server). 

Kubala does not disclose a segmentation component configured to divide the 
audio data into overlapping segments of a predetermined length. However, segmenting 
speech into overlapping intervals is known, as indicated in Aversano. Aversano 
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discloses a system that segments an input speech signal into 20 ms frames (intervals), 
with 10ms of frame-overlap (page 516, section 2, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to apply the known technique of segmenting an input stream into 
overlapping predetermined intervals in Kubala, since it would produce the predictable 
result of improving the detection of speech, including speaker, changes at the 
boundaries of the frames (intervals). 

As per original claim 24, Kubala in view of Aversano disclose system of claim 23, and 
Kubala further discloses wherein the indexer further includes at least one of: a speaker 
clustering component, a speaker identification component, a name spotting component, 
and a topic classification component (pages 52-55 and Figure 5). 

As per amended claim 30, Kubala discloses a device comprising: 

Means for segmenting the input audio stream into predetermined length intervals 
(page 53, first paragraph, the speech is input as frames); 

Means for decoding the intervals to produce a set of phones corresponding to 
each of the intervals (page 53, phone class recognition is performed on each frame. 
Therefore it is inherent that a set of phones was decoded for each frame); 



Application/Control Number: 10/685,586 Page 10 

Art Unit: 2626 

Means for generating a similarity measurement based on audio within one of the 
intervals that is prior to a boundary between adjacent phones and based on audio within 
the one of the intervals that is after the boundary (page 53, second paragraph, speaker 
change is hypothesized at every phone boundary using a form of a likelihood ratio test 
(similarity score); 

Means for detecting speaker changes based on the similarity measurement 
(page 53, second paragraph, speaker change is hypothesized at every phone boundary 
using a form of a likelihood ratio test (similarity score); and 

Means for outputting the detected speaker changes (page 49 and 52, and Figure 
5, the structural summarization, including the locations of speakers in a recording, are 
stored in an XML file in the Indexer subsystem). 

Kubala does not disclose segmenting the input stream into predetermined 
intervals such that the portions of the intervals overlap one another. However, 
segmenting speech into overlapping intervals is known, as indicated in Aversano. 
Aversano discloses a system that segments an input speech signal into 20 ms frames 
(intervals), with 10ms of frame-overlap (page 516, section 2, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to apply the known technique of segmenting an input stream into 
overlapping predetermined intervals in Kubala, since it would produce the predictable 
result of improving the detection of speech, or speaker, changes at the boundaries of 
the frames (intervals). 
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Claims 2,12 and 26 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Kubala in view of Aversano, and further in view of Beigi ("A Distance Measure 
Between Collection of Distributions and it's Application to Speaker Recognition" IEEE 
1998). 

As per original claim 2, Kubala in view of Aversano disclose the method of claim 1 , 
however neither disclose wherein the predetermined length intervals are approximately 
thirty seconds in length. Beigi discloses the use of intervals that are approximately thirty 
seconds long in a speech recognition system (page 756, Section 4 Results, 30 seconds 
of speech is used for training). Beigi discloses a system that models speech for 
different speakers as two sets of statistical distributions. A meaningful distance measure 
between the two distributions is calculated, which can then be used for speaker 
classification, speech segmentation and speaker verification. Beigi also uses thirty 
seconds of speech data as enrollment data. Therefore, the examiner argues that it is old 
and well known to segment audio into predetermined intervals approximately thirty 
seconds long. 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to use predetermined length intervals approximately thirty seconds in 
length in Kubala, since an interval of that length would provide robust data for training a 
speaker segmentation model. 
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As per original claim 12, this claim contains limitations that are similar to the limitations 
cited in claim 2, and is rejected for similar reasons. 

As per original claim 26, this claim contains limitations that are similar to the limitations 
cited in claim 2, and is rejected for similar reasons. 

Claims 4-6, 8-10, 15-18, 21,22, 27-29 are rejected under 35 U.S.C. 103(a) as 
being unpatentable over Kubala in view of Aversano, and further in view of Liu ("Fast 
Speaker Change Detection for Broadcast News Transcription and Indexing" 1999). 

As per previously presented claim 4, Kubala in view of Aversano disclose the method 
of claim 1 , however neither disclose wherein generating a similarity measurement 
includes: calculating cepstral vectors for the audio stream prior to the boundary and 
after the boundary, and comparing the cepstral vectors. Liu discloses a system for 
speaker change detection that calculates cepstral vectors for the audio stream prior to 
the boundary and after the boundary, and compares the cepstral vectors (Section 4 
Speaker Change Detection, subsection Distance Measure Criterion, cepstral vectors 
are used in the distance measure (similarity measure)). In addition, cepstral vectors are 
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one of many vector types used to represent speech features in speech processing 
tasks. 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to calculate cepstral feature vectors for the audio stream prior to and 
after the boundary in Kubala and Aversano, since one of ordinary skill in the art has 
good reason to pursue the options within his or her technical grasp to achieve the 
predictable result of calculating robust and reliable feature vectors. 

As per original claim 5, Kubala in view of Aversano, further in view of Liu disclose the 
method of claim 4, however neither Kubala nor Aversano disclose wherein the cepstral 
vectors are compared using a generalized likelihood ratio test. However, Kubala does 
disclose calculating a similarity measure using the Likelihood ratio test. In addition, Liu 
discloses wherein the cepstral vectors are compared using a generalized likelihood ratio 
test (Section 4 Speaker Change Detection, subsection Distance Measure Criterion). 
Cepstral vectors are one of many vector types used to represent speech features in 
speech processing tasks. 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to compare the cepstral vectors using a generalized likelihood ratio test 
in Kubala and Aversano, since one of ordinary skill in the art has good reason to 
pursue the options within his or her technical grasp to achieve the result of calculating 
robust and reliable feature vectors. 
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As per original claim 6, Kubala in view of Aversano, further in view of Liu disclose the 
method of claim 5, but neither Kubala nor Aversano disclose wherein a speaker 
change is detected when the generalized likelihood ratio test produces a value less than 
a preset threshold. However, the generalized likelihood ratio test is a standard method 
used to detect abrupt changes in a non-stationary signal, where an optimized likelihood 
forms a decision function that is compared to a threshold; a change point is indicated 
when the value exceeds a threshold. Liu discloses a system that uses the GLR (Section 
4 Speaker Change Detection, subsection the critical region). The generalized likelihood 
function used in the instant application is simply the reciprocal of the GLR commonly 
used, however its function is the same, i.e. indicating a change point in a non-stationary 
signal. 

Therefore it would have been obvious to one of ordinary in the art at the time of 
the invention to use the generalized likelihood ratio test to indicate a change when the 
value is less than a threshold in Kubala and Aversano, since one of ordinary skill in the 
art has good reason to pursue the options within his or her technical grasp in order to 
make a reliable decision as to when a speaker change has occurred in input audio data. 

As per original claim 8, Kubala in view of Aversano, disclose the method of claim 7, 
however neither disclose wherein the simplified corpus of phone classes includes a 
phone class for vowels and nasals, a phone class for fricatives, and a phone class for 
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obstruents. Liu discloses wherein the simplified corpus of phone classes includes a 
phone class for vowels, nasals, fricatives and obstruents (Section 3 Phone-Class 
Decode, Figure 1). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to have corpus of phone classes that include a phone class for vowels 
and nasals, a phone class for fricatives, and a phone class for obstruents in Kubala and 
Aversano, since vowels and nasals are similar in that they both have pitch and high 
energy, and can therefore be combined to significantly speed up processing, as 
indicated in Liu (section 3). 

As per original claim 9, Kubala in view of Aversano, further in view of Liu disclose the 
method of claim 8, and Kubala further discloses wherein the simplified corpus of phone 
classes further includes a phone class for music, laughter, breath and lip-smack (page 
53, first paragraph). However, neither Kubala nor Aversano explicitly disclose wherein 
the simplified corpus of phone classes further includes a phone class for silence. Liu 
discloses wherein the simplified corpus of phone classes further includes a phone class 
for silence (section 3). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to have a phone class for silence in Kubala and Aversano, since non- 
speech events poses valuable information about speaker changes, and can be used to 
more accurately determine speaker changes, as indicated in Liu (section 3, first 
paragraph). 
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As per original claim 10, Kubala in view of Aversano disclose the method of claim 7, 
however neither disclose wherein the simplified corpus of phone classes includes 
approximately seven phone classes. Liu discloses wherein the simplified corpus of 
phone classes includes approximately seven phone classes (section 3). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to have a corpus of phone classes including approximately seven phone 
classes in Kubala and Aversano, since it reduces the number of active nodes during 
decoding, and thus speeds up processing, as indicated in Liu (section 3, last 
paragraph). 

As per original claim 15, this claim has limitations similar to claim 8, and is therefore 
rejected for similar reasons. 

As per original claim 16, this claim has limitations similar to claim 9, and is therefore 
rejected for similar reasons. 

As per original claim 17, this claim has limitations similar to claim 10, and is therefore 
rejected for similar reasons. 
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As per amended claim 18, Kubala discloses a device for detecting speaker changes in 
an audio signal, the device comprising: 

A segmentation component configured to segment the audio signal into 
predetermined length intervals (page 53, first paragraph, the speech is input as frames); 

A phone classification decode component configured to decode the intervals to 
produce a set of phone classes corresponding to each of the intervals (page 53, phone 
class recognition is performed on each frame. Therefore it is inherent that a set of 
phones was decoded for each frame); and 

A speaker change detection component configured to detect locations of speaker 
changes in the audio signal based on a similarity value calculated over a first portion of 
the audio signal that occurs prior to a boundary between phone classes in one of the 
intervals and a second portion of the audio signal that occurs after the boundary in the 
one of the intervals boundary (page 53, second paragraph, speaker change is 
hypothesized at every phone boundary using a form of a likelihood ratio test (similarity 
score); 

Wherein an indication of the detected locations of speaker changes are output 
from the device (page 49, the system outputs a summary which indicates the location of 
speaker changes within the recording). 

Kubala does not disclose segmenting the input stream into predetermined 
intervals such that the portions of the intervals overlap one another. However, 
segmenting speech into overlapping intervals is known, as indicated in Aversano. 
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Aversano discloses a system that segments an input speech signal into 20 ms frames 
(intervals), with 10ms of frame-overlap (page 516, section 2, second paragraph). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to apply the known technique of segmenting an input stream into 
overlapping predetermined intervals in Kubala, since it would produce the predictable 
result of improving the detection of speech, or speaker, changes at the boundaries of 
the frames (intervals). 

Kubala also does not disclose wherein a number of possible phone classes 
being approximately seven. Liu discloses wherein the simplified corpus of phone 
classes includes approximately seven phone classes (section 3). 

Therefore it would have been obvious to one of ordinary skill in the art at the time 
of the invention to have a corpus of phone classes including approximately seven phone 
classes in Kubala and Aversano, since it reduces the number of active nodes during 
decoding, and thus speeds up processing, as indicated in Liu (section 3, last 
paragraph). 

As per original claim 21 , this claim contains limitations similar to those on claim 8, and is 
therefore rejected for similar reasons. 
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As per original claim 22, this claim contains limitations similar to those on claim 9, and is 
therefore rejected for similar reasons. 

As per original claim 27, this claim contains limitations similar to those on claim 8, and is 
therefore rejected for similar reasons. 

As per original claim 28, this claim contains limitations similar to those on claim 9, and is 
therefore rejected for similar reasons. 

As per original claim 29, this claim contains limitations similar to those on claim 10, and 
is therefore rejected for similar reasons. 

Claims 19 is rejected under 35 U.S.C. 103(a) as being unpatentable over Kubala 
in view of Aversano, in view of Liu as applied to claim 18 above, and further in view of 
Beigi. 

As per claim 19, this claim contains limitations similar to those on claims 2 and 12, and 
is therefore rejected for similar reasons. 
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Conclusion 

Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Dorothy Sarah Siedler whose telephone number is 571- 
270-1067. The examiner can normally be reached on Mon-Thur 9:30am-5:30pm. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Richemond Dorvil can be reached on 571-272-7602. The fax phone 
number for the organization where this application or proceeding is assigned is 571- 
273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 

DSS 

3/26/2008 



/Talivaldis Ivars Smits/ 
Primary Examiner, Art Unit 2626 



